### Make sure all the packages are installed
if (!requireNamespace("Seurat", quietly = TRUE))
install.packages("Seurat")
if (!requireNamespace("tidyverse", quietly = TRUE))
install.packages("tidyverse")
if (!requireNamespace("colorBlindness", quietly = TRUE))
install.packages("colorBlindness")
if (!requireNamespace("RColorBrewer", quietly = TRUE))
install.packages("RColorBrewer")
if (!requireNamespace("DT", quietly = TRUE))
install.packages("DT")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (!requireNamespace("openxlsx", quietly = TRUE))
BiocManager::install("openxlsx")
if (!requireNamespace("presto", quietly = TRUE))
devtools::install_github("immunogenomics/presto")
if (!requireNamespace("SeuratData", quietly = TRUE))
devtools::install_github('satijalab/seurat-data')
if (!requireNamespace("ggalluvial", quietly = TRUE))
install.packages("ggalluvial")
### Load all the necessary libraries
library(Seurat)
library(tidyverse)
library(colorBlindness)
library(RColorBrewer)
library(DT)
library(ComplexHeatmap)
library(SeuratData)
library(openxlsx)
library(ggalluvial)
set.seed(687)8 - Differential Gene Expression & Level 1 Annotation
Introduction
In this notebook we are going to pick up from the clustering notebook and focus on how to annotate a single-cell dataset. We’re briefly going to touch on how to carry out differential expression between clusters but for a more in depth explanation visit our notebook from a previous workshop. Annotating single-cell data is the most laborious part of the analysis. It requires in depth knowledge of the cell types making up your biological system and multiple rounds of iterative clustering and subsetting to get to a fine grained annotation. Luckily, there are some tools, like label transfer, that can be used to help speed up this process. These tools rely on using a relevant and previously annotated reference to help automate this annotation process. These automated methods work great for coarse cell type labels but don’t perform quite as well when annotating fine grained cell states that may be not be found in our reference. Therefore, an approach combining both methodologies is usually used. A good resource going over this process is the paper by Luecken, M, et al..
Key Takeaways
Differential Gene Expression
To annotate our clusters, we need to determine which genes are differentially expressed in each one.
Differentially expressed genes depend on which cell types we are comparing. The same cell type will have different differentially expressed genes if we change the other cell types in the dataset.
We can quantify these differentially expressed genes using effect size and discriminatory power metrics such as log2FC and AUC.
P values obtained from carrying out DGE analysis between clusters are inflated and should not be used.
Annotation
Annotation is a laborious process that uses automated and manual approach in which we use automated methods to coarsely label cells and manual annotation to identify unique cell states.
Automated annotation requires relevant reference data that has been previously annotated. We will only be able to identify cell types that were annotated in the reference.
Manual annotation is based on literature knowledge and digging through previous annotation efforts or field-relevant papers describing cell types of interest using other methods - bulk RNAseq, FACS…
Useful Annotation Literature
T cells
- Interpretation of T cell states from single-cell transcriptomics data using reference atlases
- Single-cell atlas of healthy human blood unveils age-related loss of NKG2C+GZMB−CD8+ memory T cells and accumulation of type 2 memory T cells
- Single-cell transcriptomics of human T cells reveals tissue and activation signatures in health and disease
- Single-Cell Transcriptomics of Regulatory T Cells Reveals Trajectories of Tissue Adaptation
- CD8 + T cell differentiation and dysfunction in cancer
- CD8 + T cell states in human cancer: insights from single-cell analysis
B cells
- An atlas of cells in the human tonsil
- Single-cell analysis of human B cell maturation predicts how antibody class switching shapes selection dynamics
Monocyte-Macrophages
- Tissue-resident macrophages provide a pro-tumorigenic niche to early NSCLC cells
- Single cell RNA sequencing identifies unique inflammatory airspace macrophage subsets
Neutrophils
- The neutrotime transcriptional signature defines a single continuum of neutrophils across biological compartments
- Cellular and transcriptional dynamics of human neutrophils at steady state and upon stress
- Single-cell transcriptome profiling reveals neutrophil heterogeneity in homeostasis and infection
NKs
- Single-cell transcriptome reveals the novel role of T-bet in suppressing the immature NK gene signature - Suppl. 1
- Functionally distinct subsets of human NK cells and monocyte/DC-like cells identified by coexpression of CD56, CD7, and CD4
- Immune Circuits to Shape Natural Killer Cells in Cancer
DCs
- Cross-Presenting XCR1+ Dendritic Cells as Targets for Cancer Immunotherapy
- Transcriptional Basis of Mouse and Human Dendritic Cell Heterogeneity
Endothelial
Libraries
Load data
We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from cellxgene portal.
# Download the data in data/ directory
# download.file(
# url = "https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds",
# destfile = "../data/workshop-data.rds",
# method = "wget",
# extra = "-r -p --random-wait")
# We can also use the CLI with the wget command below
# wget https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds
# se <- readRDS("../data/Covid_Flu_Seurat_Object.rds")
se <- readRDS(file = "../data/clustered_se.rds")Analysis
The seurat object already comes pre-processed from the previous step! We can go right in to annotate the clusters.
Check how the clusters look on the UMAP
se$sample_id <- se$sample_id
DimPlot(
se,
group.by = c(
"RNA_snn_res.0.01", "RNA_snn_res.0.05",
"RNA_snn_res.0.1", "RNA_snn_res.0.25"),
label = TRUE)dim_plt <- DimPlot(
se,
group.by = c("RNA_snn_res.0.05"),
label = TRUE)And the original cell type labels + the sample IDs
DimPlot(
se,
group.by = c("Celltype", "sample_id"),
label = FALSE)For the purpose of this tutorial we’re going to go forward with resolution 0.05!
DGE Wilcoxon
The different implementations Seurat incorporates provides in FindAllMarkers compare the gene expression between 2 groups of cells. This one vs all strategy is very quick and returns the avg_log2FC. This avg_log2FC is computed as detailed here & here. Since we’re working with normalized data the log2FC can be directly computed by subtracting the average expression between both groups - log_{2}(\frac{exp1}{exp2})=log_{2}(Avg\_exp1)-log_{2}(Avg\_exp2)
Idents(se) <- se$RNA_snn_res.0.05
mgs <- FindAllMarkers(
se,
test.use = "wilcox",
slot = "data",
only.pos = TRUE,
logfc.threshold = 0.5,
min.pct = 0.25)Look at the results in a dynamic table:
DT::datatable(mgs, filter = "top")Look at the results in a heatmap
top10 <- mgs %>%
arrange(cluster, desc(avg_log2FC)) %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1) %>%
slice_head(n = 10) %>%
ungroup()
DoHeatmap(se, features = top10$gene) + NoLegend()Save marker genes to a spreadsheet
mgs_ls <- lapply(unique(mgs$cluster), function(i) {
mgs %>% dplyr::filter(cluster == i)
})
# Set names which will be the sheet name
names(mgs_ls) <- unique(mgs$cluster)
# Save marker genes to spreadsheet
openxlsx::write.xlsx(mgs_ls, file = "../data/lvl1_mgs.xlsx")P-value interpretation
More details can be obtained in OSCA.
P values obtained from DGE analysis are inflated and, therefore invalid in their interpretation. We can’t use p-values to reject the Null Hypothesis since we are carrying out data snooping. This means that we are dividing the clusters based on their gene expression, and then computing p-values for the genes that are differentially expressed, even though we know these clusters have different gene expression patterns since we clustered the data based on them being different.
A way to show this is by looking at how skewed the distributions of the p-values obtained is:
# Compute the p-values without he thresholds
mgs2 <- FindAllMarkers(
se,
test.use = "wilcox",
only.pos = TRUE,
logfc.threshold = 0,
min.pct = 0,
return.thresh = 1,
max.cells.per.ident = 100 # use 100 cells per cell type for speed
)
ggplot(mgs2, aes(x = p_val, fill = cluster, color = cluster)) +
# geom_histogram(alpha = 0.3, position = "identity") +
geom_density(alpha = 0.3) +
theme_minimal()ggplot(mgs2, aes(x = p_val, fill = cluster, color = cluster)) +
geom_histogram(alpha = 0.3, position = "identity") +
facet_wrap(~cluster, scales = "free") +
theme_minimal()Annotation
There are two main ways to annotate your single cell dataset which are usually used together to aid in the process:
Automatic Cell Type Annotation: typically requires a reference dataset representative of the samples we are trying to annotate. Some commonly used tools in R are
SingleRwhich has very good documentation andSeurat’s reference mapping either through their Azimuth server or following their reference mapping vignette which performs remarkably well.Manual Cell Type Annotation: is based on prior biological knowledge of marker genes and literature review of the markers that are differentially expressed in each cluster. Good marker genes for a population are those that have a high log2FC, a high pct.1 (expressed in many cells in that group) and low pct.2 (expression absent in other groups).
Azimuth
We can use Azimuth online app to manually use a reference dataset to annotate our dataset.
To do so we need to save the counts matrix as an RDS file.
# saveRDS(
# # subsample our data to 20% for a quick example
# object = se[, sample(colnames(se), .2 * ncol(se))]@assays$RNA$counts,
# file = "../data/counts.rds")Reference Mapping
We are going to download a reference human PBMC data using the SeuratData package. You can use whichever reference suits your data best.
# SeuratData::AvailableData()
SeuratData::InstallData("pbmc3k", force.reinstall = TRUE)
# http://seurat.nygenome.org/src/contrib/pbmcref.SeuratData_1.0.0.tar.gz
SeuratData::InstalledData() Dataset Version Summary species system ncells tech seurat default.dataset disk.datasets other.datasets notes Installed InstalledVersion
lungref.SeuratData lungref 2.0.0 Azimuth Reference: lung human lung 584884 <NA> <NA> <NA> <NA> <NA> <NA> TRUE 2.0.0
panc8.SeuratData panc8 3.0.2 Eight Pancreas Datasets Across Five Technologies human Pancreatic Islets 14892 SMARTSeq2, Fluidigm C1, CelSeq, CelSeq2, inDrops <NA> raw <NA> <NA> <NA> TRUE 3.0.2
pbmc3k.SeuratData pbmc3k 3.1.4 3k PBMCs from 10X Genomics human PBMC 2700 10x v1 3.1.4 raw <NA> pbmc3k.final <NA> TRUE 3.1.4
pbmcref.SeuratData pbmcref 1.0.0 Azimuth Reference: pbmc human PBMC 2700 10x v1 <NA> <NA> <NA> <NA> <NA> TRUE 1.0.0
pbmcsca.SeuratData pbmcsca 3.0.0 Broad Institute PBMC Systematic Comparative Analysis human PBMC 31021 10x v2, 10x v3, SMARTSeq2, Seq-Well, inDrops, Drop-seq, CelSeq2 <NA> raw <NA> <NA> HCA benchmark TRUE 3.0.0
ref <- LoadData("pbmcsca")Look at the data
table(ref$CellType)
B cell CD14+ monocyte CD16+ monocyte CD4+ T cell Cytotoxic T cell Dendritic cell Megakaryocyte Natural killer cell Plasmacytoid dendritic cell Unassigned
5020 5550 804 7391 9071 433 977 1565 164 46
Preprocess data - Normalize, identify HVG and look at the Elbow plot. We start by normalizing the data just to make sure both datasets are processed the same way. We then use the union of the HVG to make sure we capture the biological variability of both datasets. Next we compute the UMAP, it is necessary to specify return.model=TRUE! Lastly, we look at the Elbow plot to assess how many dimensions to use in downstream analysis.
# Renormalize data and find HVG to make sure both objects are pre-processed equally
se <- NormalizeData(se) %>%
FindVariableFeatures(method = "vst", nfeatures = 3000)
ref <- NormalizeData(ref) %>%
FindVariableFeatures(method = "vst", nfeatures = 3000) %>%
ScaleData() %>%
RunPCA()
ElbowPlot(ref, ndims = 30)# Make sure to return.model=TRUE here! It is necessary for the following steps
ref <- ref %>%
Seurat::RunUMAP(reduction = "pca", dims = 1:30, return.model = TRUE)
# Use the union of HVG
hvg <- union(VariableFeatures(se), VariableFeatures(ref))
length(hvg)[1] 5516
Reference mapping - there is very good explanation in the documentation of ?FindTransferAnchors. Basically what we are doing here is embedding the cells into the same latent space. Anchors - defined as pairs of cells contained within each other’s k-neighborhood - are identified. Lastly, with TransferData the shared nearest neighbors overlap between the anchors and the cells is used to obtain the that cell’s predicted label.
# Find anchors in reference dataers
anchors <- FindTransferAnchors(
reference = ref,
query = se,
dims = 1:30,
reference.reduction = "pca",
features = hvg,
normalization.method = "LogNormalize")
# Extract the predictions
predictions <- TransferData(
anchorset = anchors,
refdata = ref$CellType,
dims = 1:30)
head(predictions) predicted.id prediction.score.Cytotoxic.T.cell prediction.score.CD4..T.cell prediction.score.CD14..monocyte prediction.score.B.cell prediction.score.Megakaryocyte prediction.score.CD16..monocyte prediction.score.Natural.killer.cell prediction.score.Dendritic.cell prediction.score.Plasmacytoid.dendritic.cell prediction.score.Unassigned prediction.score.max
AAACGGGTCATAGCAC-1-4918STDY7273964 CD4+ T cell 0.000000000 1.0000000 0.000000000 0.000000000 0 0.000000000 0 0.0000000 0 0.0000000 1.0000000
AAAGATGAGTCCAGGA-1-4918STDY7273964 CD4+ T cell 0.001985238 0.5051199 0.032647963 0.155932137 0 0.007787762 0 0.1647111 0 0.1318159 0.5051199
AAAGATGCAAGCCCAC-1-4918STDY7273964 CD4+ T cell 0.023293256 0.9700201 0.000000000 0.006686657 0 0.000000000 0 0.0000000 0 0.0000000 0.9700201
AAAGATGTCAGTCAGT-1-4918STDY7273964 CD4+ T cell 0.000000000 1.0000000 0.000000000 0.000000000 0 0.000000000 0 0.0000000 0 0.0000000 1.0000000
AAAGATGTCTAACTTC-1-4918STDY7273964 Dendritic cell 0.000000000 0.2497122 0.002452702 0.004941977 0 0.000000000 0 0.4912594 0 0.2516337 0.4912594
AAAGCAACAACTGGCC-1-4918STDY7273964 B cell 0.000000000 0.0000000 0.000000000 1.000000000 0 0.000000000 0 0.0000000 0 0.0000000 1.0000000
Let’s add these predictions to our seurat object
# colnames(predictions)
se <- AddMetaData(
se,
metadata = predictions[, c("predicted.id", "prediction.score.max")])Look at the label transfer
DimPlot(se, group.by = c("annotation_V2", "predicted.id"), label = TRUE) & NoLegend()Since we also got a confidence value for each cell we can visualize it
FeaturePlot(se, features = c("prediction.score.max", "prediction.score.Unassigned"))Manual Annotation
Cluster 0
Let’s look at genes that are differentially expressed
egenes <- c("EPCAM", "CLDN7", "KRT19", "CLDN3", "KRT8", "MUC3A")
FeaturePlot(
se,
features = egenes,
ncol = 3) +
dim_pltVlnPlot(
se,
features = egenes,
group.by = "RNA_snn_res.0.05") +
dim_pltCluster 1
Let’s look at genes that are differentially expressed
tgenes <- c("CD3D", "CD3E", "TRAC", "TRBC2", "CD8B", "CD4")
FeaturePlot(
se,
features = tgenes,
ncol = 3) +
dim_pltVlnPlot(
se,
features = tgenes,
group.by = "RNA_snn_res.0.05") +
dim_pltClusters 0 look like T cells
Cluster 2
Let’s look at genes that are differentially expressed
pgenes <- c("MZB1", "CD79A", "XBP1", "IGHA2", "IGHA1", "JCHAIN")
FeaturePlot(
se,
features = pgenes,
ncol = 3) +
dim_pltVlnPlot(
se,
features = pgenes,
group.by = "RNA_snn_res.0.05") +
dim_pltCluster 2 are Plasma Cells.
Cluster 3 & 9
Let’s look at genes that are differentially expressed
fgenes <- c("LUM", "DCN", "COL1A2", "COL3A1", "COL6A2", "IGFBP7", "MFAP4")
FeaturePlot(
se,
features = fgenes,
ncol = 3) +
dim_pltVlnPlot(
se,
features = fgenes,
group.by = "RNA_snn_res.0.05") +
dim_pltCluster 3 are fibroblasts
Cluster 4 & 5
Let’s look at genes that are differentially expressed
bgenes <- c("MS4A1", "CD79A", "CD79B", "IGHD", "IGHM")
FeaturePlot(
se,
features = bgenes,
ncol = 3) +
dim_pltVlnPlot(
se,
features = bgenes,
group.by = "RNA_snn_res.0.05") +
dim_pltCluster 4 is expressing B cell genes
Cluster 6
Let’s look at genes that are differentially expressed
mgenes <- c("CD14", "FCGR3A", "S100A8", "VCAN", "LYZ")
FeaturePlot(
se,
features = mgenes,
ncol = 3) +
dim_pltVlnPlot(
se,
features = mgenes,
group.by = "RNA_snn_res.0.05") +
dim_pltCluster 6 look like myeloid cells
Cluster 7
Let’s look at genes that are differentially expressed
ggenes <- c("KLF4", "TFF1", "SYTL2")
FeaturePlot(
se,
features = ggenes,
ncol = 3) +
dim_pltVlnPlot(
se,
features = ggenes,
group.by = "RNA_snn_res.0.05") +
dim_pltCluster 7 look like Goblet cells
Cluster 8
Let’s look at genes that are differentially expressed
vgenes <- c("RAMP2", "PLVAP", "PECAM1", "VWF", "CD34", "CAV1")
FeaturePlot(
se,
features = vgenes,
ncol = 3) +
dim_pltVlnPlot(
se,
features = vgenes,
group.by = "RNA_snn_res.0.05") +
dim_pltCluster 8 is expressing vascular endothelial cell markers
Cluster 10
Let’s look at genes that are differentially expressed
lgenes <- c("CCL21", "LYVE1", "PROX1", "CLDN5")
FeaturePlot(
se,
features = lgenes,
ncol = 3) +
dim_pltVlnPlot(
se,
features = lgenes,
group.by = "RNA_snn_res.0.05") +
dim_pltCluster 10 is expressing lymphatic endothelial cell markers
Cluster 11
Let’s look at genes that are differentially expressed
pegenes <- c("RGS5", "NOTCH3", "SOD3", "GJA4", "MGP", "ACTA2")
FeaturePlot(
se,
features = pegenes,
ncol = 3) +
dim_pltVlnPlot(
se,
features = pegenes,
group.by = "RNA_snn_res.0.05") +
dim_pltCluster 11 is expressing pericyte markers
Annotate
According to the markers observed we can make a first general annotation
se@meta.data <- se@meta.data %>%
dplyr::mutate(
annotation_lvl1 = dplyr::case_when(
RNA_snn_res.0.05 == 0 ~ "Epithelial",
RNA_snn_res.0.05 == 1 ~ "T cells",
RNA_snn_res.0.05 == 2 ~ "Plasma Cells",
RNA_snn_res.0.05 == 3 ~ "Fibroblasts",
RNA_snn_res.0.05 == 4 ~ "B cells",
RNA_snn_res.0.05 == 5 ~ "B cells",
RNA_snn_res.0.05 == 6 ~ "Myeloid cells",
RNA_snn_res.0.05 == 7 ~ "Goblet cells",
RNA_snn_res.0.05 == 8 ~ "Vascular Endothelial cells",
RNA_snn_res.0.05 == 9 ~ "Fibroblasts",
RNA_snn_res.0.05 == 10 ~ "Lymphatic Endothelial cells",
RNA_snn_res.0.05 == 11 ~ "Pericytes"
)
)
DimPlot(se, group.by = "annotation_lvl1")Summary genes
We can visualize this as a dotplot
order <- unique(c(
"T cells", "Myeloid cells", "B cells", "Plasma Cells",
"Epithelial", "Goblet cells", "Fibroblasts",
"Pericytes", "Vascular Endothelial cells", "Lymphatic Endothelial cells"
))
se$annotation_lvl1_ord <- factor(
x = se$annotation_lvl1,
levels = unique(order))
## Genes for DOTPLOT
dplot_genes <- c(
# T cell genes
tgenes,
# Monocytes
mgenes,
# B cells
bgenes,
# Plasma Cells
pgenes,
# Epithelial
egenes,
# Goblet cells
ggenes,
# Fibroblasts
fgenes,
# Pericytes
pegenes,
# Vascular Endothelial cells
vgenes,
# Lymphatic Endothelial cells
lgenes
)
dplot_genes <- unique(dplot_genes)
Seurat::DotPlot(
object = se,
features = dplot_genes,
group.by = "annotation_lvl1_ord",
col.min = 0,
dot.min = 0) +
ggplot2::scale_x_discrete(
breaks = dplot_genes) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 60, hjust = 1)) +
ggplot2::labs(x = "", y = "")We can also visualize this as a heatmap using Seurat’s DoHeatmap function:
Seurat::DoHeatmap(
se,
features = dplot_genes,
group.by = "annotation_lvl1_ord")Or with pheatmap
# Cell types in the order the gdotplot genes are set
ct_vec <- order
lvl1_pal <- c(
"T cells" = "#228B22", # Forest Green
"Myeloid cells" = "#DAA520", # Goldenrod
"B cells" = "#4682B4", # Steel Blue
"Plasma Cells" = "#6A5ACD", # Slate Blue
"Epithelial" = "#FF8C00", # Dark Orange
"Goblet cells" = "#DA70D6", # Orchid
"Fibroblasts" = "#8B4513", # Saddle Brown
"Pericytes" = "#3CB371", # Medium Sea Green
"Vascular Endothelial cells" = "#DC143C", # Crimson
"Lymphatic Endothelial cells" = "#008080" # Teal
)
# Subset to 30% of the dataset
se_30 <- se[, sample(colnames(se), 0.5 * ncol(se))]
hm_ls <- lapply(ct_vec, function(i) {
se_sub <- se_30[, se_30$annotation_lvl1 == i]
# Extract Gene Expression Matrix from Seurat Object
gene_expr <- GetAssayData(se_sub, assay = "RNA", slot = "scale.data")
# Subset the genes intersecting between gene expression and genes of interest
g_int <- dplot_genes[dplot_genes %in% rownames(gene_expr)]
# Subset expression matrix to only genes of interest
gene_expr <- gene_expr[g_int, ]
# Add the score of the signature as annotation in the heatmap
colAnn <- ComplexHeatmap::HeatmapAnnotation(
df = se_sub@meta.data[, c("annotation_V2", "annotation_lvl1"), drop = FALSE],
# name = "Celltype",
which = 'column',
col = list("annotation_lvl1" = lvl1_pal),
show_annotation_name = FALSE)
# Visualize the Heatmap with the genes and signature
ComplexHeatmap::Heatmap(
as.matrix(gene_expr),
name = "Scaled Gene Expression",
# col = expr_cols,
cluster_rows = FALSE,
cluster_columns = TRUE,
# column_title = sig_name,
column_names_gp = gpar(fontsize = 14),
show_column_names = FALSE,
top_annotation = colAnn,
)
})
# Return ComplexHeatmap
(plt1 <- (hm_ls[[1]] + hm_ls[[2]] + hm_ls[[3]] + hm_ls[[4]] + hm_ls[[5]]))(plt2 <- (hm_ls[[6]] + hm_ls[[7]] + hm_ls[[8]] + hm_ls[[9]] + hm_ls[[10]]))Annotation agreement
Lastly let’s see if our Manual annotation agrees with the reference annotation:
DimPlot(
se,
group.by = c("annotation_lvl1", "predicted.id", "annotation_V2"),
ncol = 2)We can also check the overlap between manual and automatic annotation as follows:
se@meta.data %>%
# Count the instances each combination of annotations happen
dplyr::count(annotation_lvl1, predicted.id) %>%
ggplot(aes(x = annotation_lvl1, y = predicted.id, fill = n, color = n, label = n)) +
geom_tile(color = "lightgrey") +
geom_text(color = "lightgrey") +
scale_fill_viridis_c() +
theme_classic() +
theme(legend.position = "none")Check the overlap between manual and the author provided annotation:
se@meta.data %>%
# Count the instances each combination of annotations happen
dplyr::count(annotation_lvl1, annotation_V2) %>%
ggplot(aes(x = annotation_lvl1, y = annotation_V2, fill = n, color = n, label = n)) +
geom_tile(color = "lightgrey") +
geom_text(color = "lightgrey") +
scale_fill_viridis_c() +
theme_classic() +
theme(legend.position = "none")Now with an alluvial plot
se@meta.data %>%
dplyr::count(annotation_lvl1, predicted.id, annotation_V2) %>%
ggplot(
aes(axis1 = annotation_lvl1, axis2 = predicted.id, axis3 = annotation_V2,
y = n)) +
scale_x_discrete(limits = c("annotation_lvl1", "predicted.id", "annotation_V2"), expand = c(.2, .05)) +
geom_alluvium(aes(fill = annotation_lvl1), alpha = 0.9, aes.bind = TRUE) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
labs(
title = "Cell type annotation lables across strategies",
x = "Annotations") +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "darkgrey")) +
scale_fill_manual(values = lvl1_pal)Save objects
saveRDS(object = se, file = "../data/se_lvl1.rds")Playground with cluster 5
Cluster 5 seems to have a particularly low Silhouette analysis. This potentially means these cells are not clustered correctly. This could be due to this cluster being under-clustered - leading to multiple cell types/states ending up within it. Let’s check if there is heterogeneity within cluster 5.
Idents(se) <- se$RNA_snn_res.0.05
se <- FindSubCluster(se, cluster = 5, resolution = c(0.1), graph.name = "RNA_snn")
se_5 <- se[, se$RNA_snn_res.0.05 == 5]
DimPlot(se_5, group.by = "sub.cluster")
## It looks like it has found 5 subclusters on SNN embedding space
Idents(se_5) <- se_5$sub.cluster
mgs <- FindAllMarkers(
se_5,
only.pos = TRUE,
logfc.threshold = 0.5,
min.pct = 0.25)
DT::datatable(mgs, filter = "top")Lastly, we should look at cells that might be between cell types or proliferating. These are often cells in transition and can be clustered across the UMAP.
FeaturePlot(
se_5,
features = c(
# T cell genes
"TRDC", "TRGC1", "TRGC2",
"TRAC", "TRBC1", "TRBC2",
# Proliferation genes
"TOP2A", "MKI67", "STMN1",
# Myeloid genes
"S100A8", "S100A9", "HBB",
# B cell genes
"JCHAIN", "CD79A", "MS4A1",
"IGHA1", "MZB1", "IGHG1", "IGHG3",
# NK genes
"FCGR3A", "KLRF1", "KLRC2",
"NCR1", "NCR2", "NCR3"),
ncol = 3
)Extra!!!
See below how the avg_log2FC calculation is done! Code extracted from Seurat’s codebase.
features <- rownames(se) == "MS4A1"
cells.1 <- se$Celltype == "B cell, IgG+"
cells.2 <- se$Celltype != "B cell, IgG+"
data.use <- GetAssayData(object = se, assay.type = "RNA", slot = "data")
pseudocount.use <- 1
base <- 2
# Calculate fold change
mean.fxn <- function(x) {
return(log(x = (rowSums(x = expm1(x = x)) + pseudocount.use)/NCOL(x), base = base))
}
data.1 <- mean.fxn(data.use[features, cells.1, drop = FALSE])
data.2 <- mean.fxn(data.use[features, cells.2, drop = FALSE])
# Look at log2FC
(fc <- (data.1 - data.2))Check if its equal to the avg_log2FC obtained from FindAllMarkers:
fc == mgs[mgs$cluster == "B cell, IgG+" & mgs$gene == "MS4A1", "avg_log2FC"]Session Info
sessionInfo()R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] pbmc3k.SeuratData_3.1.4 ggalluvial_0.12.5 openxlsx_4.2.7.1 pbmcsca.SeuratData_3.0.0 pbmcref.SeuratData_1.0.0 panc8.SeuratData_3.0.2 lungref.SeuratData_2.0.0 SeuratData_0.2.2.9001 ComplexHeatmap_2.18.0 DT_0.33 RColorBrewer_1.1-3 colorBlindness_0.1.9 lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 Seurat_5.2.1 SeuratObject_5.0.2 sp_2.1-4
loaded via a namespace (and not attached):
[1] RcppAnnoy_0.0.22 splines_4.3.1 later_1.4.1 polyclip_1.10-7 fastDummies_1.7.5 lifecycle_1.0.4 doParallel_1.0.17 globals_0.16.3 lattice_0.22-6 MASS_7.3-60 crosstalk_1.2.1 magrittr_2.0.3 sass_0.4.9 limma_3.58.1 plotly_4.10.4 rmarkdown_2.29 jquerylib_0.1.4 yaml_2.3.10 httpuv_1.6.15 sctransform_0.4.1 spam_2.11-0 zip_2.3.1 spatstat.sparse_3.1-0 reticulate_1.40.0 cowplot_1.1.3 pbapply_1.7-2 abind_1.4-8 Rtsne_0.17 presto_1.0.0 BiocGenerics_0.48.1 rappdirs_0.3.3 circlize_0.4.16 IRanges_2.36.0 S4Vectors_0.40.2 ggrepel_0.9.6 irlba_2.3.5.1 listenv_0.9.1 spatstat.utils_3.1-2 goftest_1.2-3 RSpectra_0.16-2 spatstat.random_3.3-2 fitdistrplus_1.2-2 parallelly_1.41.0 codetools_0.2-20 tidyselect_1.2.1 shape_1.4.6.1 farver_2.1.2 matrixStats_1.5.0 stats4_4.3.1 spatstat.explore_3.3-4 jsonlite_1.8.9
[52] GetoptLong_1.0.5 progressr_0.15.1 ggridges_0.5.6 survival_3.8-3 iterators_1.0.14 foreach_1.5.2 tools_4.3.1 ica_1.0-3 Rcpp_1.0.14 glue_1.8.0 gridExtra_2.3 xfun_0.50 withr_3.0.2 fastmap_1.2.0 digest_0.6.37 timechange_0.3.0 R6_2.5.1 mime_0.12 gridGraphics_0.5-1 colorspace_2.1-1 Cairo_1.6-2 scattermore_1.2 tensor_1.5 spatstat.data_3.1-4 generics_0.1.3 data.table_1.16.4 httr_1.4.7 htmlwidgets_1.6.4 uwot_0.1.16 pkgconfig_2.0.3 gtable_0.3.6 lmtest_0.9-40 htmltools_0.5.8.1 dotCall64_1.2 clue_0.3-66 scales_1.3.0 png_0.1-8 spatstat.univar_3.1-1 knitr_1.49 rstudioapi_0.17.1 tzdb_0.4.0 reshape2_1.4.4 rjson_0.2.23 nlme_3.1-166 cachem_1.1.0 zoo_1.8-12 GlobalOptions_0.1.2 KernSmooth_2.23-26 vipor_0.4.7 parallel_4.3.1 miniUI_0.1.1.1
[103] ggrastr_1.0.2 pillar_1.10.1 vctrs_0.6.5 RANN_2.6.2 promises_1.3.2 xtable_1.8-4 cluster_2.1.8 beeswarm_0.4.0 evaluate_1.0.3 magick_2.8.5 cli_3.6.3 compiler_4.3.1 rlang_1.1.4 crayon_1.5.3 future.apply_1.11.3 labeling_0.4.3 ggbeeswarm_0.7.2 plyr_1.8.9 stringi_1.8.4 viridisLite_0.4.2 deldir_2.0-4 munsell_0.5.1 lazyeval_0.2.2 spatstat.geom_3.3-4 Matrix_1.6-5 RcppHNSW_0.6.0 hms_1.1.3 patchwork_1.3.0 future_1.34.0 statmod_1.5.0 shiny_1.10.0 ROCR_1.0-11 igraph_2.1.2 bslib_0.8.0